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ABSTRACT 

There are few existing or widely known measures of 
agreeaent applicable when data is noainal or categorical* Host such 
coefficients are applicable only when judges classify objects or 
ST^bjects into a single category. A wider range of applications, 
including those where judges (1) place probabilities on subjects 
belonging to autually exclusive and exhaustive noainal categories, or 
(2) ranic order the applicability of categories to subjects, is 
desirable. A generalized ANOVA model is presented which allows the 
estiaation of various reliability coefficients of interest for all 
classification tasks described. (Author) 
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ABSTRACT 



A GENERALIZED AKOVA MODEL FOR ESTIMATING 
THE RELIABILITY OF CATEGORICAL JUDGMENTS 



John M. Enger and Douglas R. Whitney 



There are few existing or widely known measures of agreement applicable 
when data Is nomlaal or categorical. Most such coefficients are applicable 
only when judges classify objects or subjects Into a single category. 
A wider range of applications Is desirable to Include those ^ere judges 
(1) place probabilities on subjects belonging to mutually exclusive and 
exhaustive nominal categories or (2) rank order the applicability of categories 
to subjects, or (3) assign waights to the appropriateness of the placement 
of subjects into each of a set of categories. A generalized ANOVA model is 
presented which allows the estimation of various reliability coefficients 
of Interest for all assignment tasks described. 
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A GENERALIZED ANOVA MODEL FOR ESTIMATING 
THE RELIABILITY OF CATEGORICAL JUDGMENTS 



Situations frequently arise In research activities which require that 
the experimenter obtain from judges estimates of the degree to which each of 
a set of objects belongs to each of a prescribed set of classes or categories. 
Conventionally » the degree of agreement among judges Is presented as a 
measure of the reliability of the judgments. In the familiar case In which 
the ratings solicited from judges represent an amount of a single trait or 
characteristic » analysis of variance techniques have provided useful reliability 
Indices (c.f.g., Ebel, 1951). The situations of concern In this paper» however » 
are of a different nature. Our concern Is with situations In which each 
judge Is required to simultaneously evaluate the degree to which an object 
possesses a specified set of traits or characteristics. 

Four recent examples will serve to Illustrate this kind of judgmental 
activity; 

1. Pyrczak and Rasmuseen (1973) asked two Judges to classify each of 

52 Items on a standardized readltig test into one of seven categories. 

2. Board and Whitney (1972) asked six judges to classify each of 

20 multiple-choice test items into one of five categories according 
to whether any of four poor item-writing practices had been used. 
(The fifth category was "no flaws.") 

3. Robinson (1974) asked three judges to allocate each of 232 test 
items » from the Iowa Tests of Basic Skills » integer values of 

0» 1 or 2 according to the degree to which each item required the 
use of three cognitive learning styles (relational^ descriptive, 
categorical) . 
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4. Enger (1975) asked 16 students In an educational psychology class 
to allocate each of 25 test Itemst from a standardized test In 
educational psychology. Integer values of 0-9 according to the 
degree tc which each Item represented three major content areas 
of the course. Other assignment tasks investigated Included those 
where students placed their probabilities as to the appropriateness 
of the placement of test Items Into three content areas and one 
In which students ranked the appropriateness of the placement of 
test Items Into three content areas* 
Many procedures are available for the first situation (two Judges, simple 
classification) and one has been extended to the second (more than two Judges, 
simple classification). No well-known procedures are available for quantify- 
ing the degree of agreement among Judges. This paper will 

a. describe. Illustrate, and compare nwo Indices appropriate for 

simple classification* 
b* develop a generalized analysis of variance for expressing the 

reliability. of such Judgments which is appropriate for all situations 
described above, and 
c. illustrate the applications of the generalized technique to potential 
research eituations. 

AGREEMENT AND ASSOCIATION 

For the simplest Judgment situation (two Judges, simple classification), 
there are a number of indices which express the strength of the relationship 
between Judges' classifications. (That is, the degree to which the Joint 
frequencies differ from those estimated from the products of the marginal 
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frequencies.) Some of these indices are the contingency coefficient (Guilford, 
1956) , the lambda coefficient and Guttman^s lambda coefficient (Goodman and 
Krushal, 1954) which are useful as measures of association, but do not elicit 
values of agreement exclusively in the t^l>~l] range and thus are unacceptable 
as indications of reliability. An index of agreement among judges should be 
unity if and only if all Judges agree exactly on the assignment of all subjects. 
The expected value when there is no relationship between judges should be 
approximately zero. 

INDICES OF AGREEHENT 

Simple Classif icatlon» Two Judj^es 

When two judges classify each of £ objects into one of c mutually exclusive 
and exhaustive classes, the frequencies may be displayed as a c x c table. 
Cohen (1960) illustrated the situation with an example in which two judges 
(psychiatrists) placed subjects (patients) Into three categories (l-schlzo- 
phrenic, 2«neurotic, and 3*brain-*-damiaged) • Figure 1 Illustrates frequencies 
from a second example in the same article. 

Insert Figure 1 about here 

Scott (1955) proposed a coefficient of intercoder agreement which Involves 
the frequency of agreement between judges. Since some agreements would be 
expected to occur even under random classifications, Scott standardized the 
frequency of agreement by the frequency expected by chance. The latter was 
based on the squares of the average marginal frequencies. Specifically, 
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whe^e f is the relative frequency with which judges agreed on placement in 
category j, f^^ and f^^ are the relative frequencies with which the two judges 
used category j, and n is the number of objects classified; N represents the 
total number of observations. Using the data from Figure 1, 



1^ ^ 



loo - /ll£Jif!t!i!) 

V. zoo / 



= .487. 



Note that to obtain the relative frequency of agreement expected by chance, 
Scott used the average marginal frequency for each category. In the event 
that the marginal frequencies differ between judge8,1^ cannot reach unity. 
The expected value of it Is near zero under the hypothesis of Independent 
(random) classifications. 

A later coefficient, kappa (lc), described by Cohen (1960) differs from IT 
only In the manner In which the relative frequency of agreement expected by 
chance is computed. Cohen used the sum of cross products of the marginal 
relative frequencies Instead of the squared average marginals as Intt 
Cohen's coefficient Is 
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I = 



c 

^' A/ 



Using the data In Figure 1, 



ZOO - ((»o + i6+4) 



(2) 



= .412 . 



It Is easy to establish that» for any set of data» )(^%Kvlth equality 
holding only when the marginal frequencies for each category are Identical 
for both Judges. The expected value of k under random classification Is near 
zero (Everltt» 1968). 

Simple Classification, IVo or More Judges 

To extend Scott's IT coefficient for more than two judges » It Is simply 
necessary to use the squares of the £ category marginal frequencies to obtain 
the adjustment for "chance" agreements* Let f^^^j^ be 1 If object k was classified 
In category by Judge i and 0 If not. The expected relative frequency of 
agreement due to chance la 7 f /irs)^ where the dot denotes a summation 
over the deleted subscript. The extension of tt Is then 

4 

* S t 



1T = 



>' k?t 



(3) 



(rs)'- 
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In a recent article, Fleiss (1971) proposed an extension of lc which Is 
Identical to equation (3). This extension is more appropriately considered 
an extension of It because of the use of pooled or averaged marginal frequencies 
to obtain the frequency of expected agreements due to chance. 

Figure 2 presents data obtained by Board and Whitney by having six judges 
classify test items into five categories. Using this data, the extension of Tt 
takes the following value: 

ir = (f^)i^o)Ls) agio)] ^ 



too 14400 



3314 



14400 

In order to extend |c properly, one should use the sum of the cross products 
of pairs of marginal frequencies to obtain the expected agreements by chance 
(Light, 1971). Thus, the extended k. coefficient would be 

££ i.r -rs lli (£(. (, ) 



z£f.ih,A...\ ^•^^ 



This equation would be very awkward as a computing method, but It can be 
easily verified that Equation 5 Is algebraically equivalent* 



r c 



(5) 



Using the data from Figure 2, the k coefficient takes the value 
I 400 12,000 , ^ 

12.000 

Thus, aj for the simpler case, |^iT^ although the difference for this data is 
negligible. The difference between values increases as the judges* category 
marginal frequencies become more disparate. 
General Methods 

Assume that each of x judges assigns some value x... to represent the 
proximity or resemblance of object k to category J^. These data may be displayed 
in an r^ X £ X s^ matrix with each cell containing a single observation. Follow- 
ing the usual ANOVA procedures, the total sum of squares (SS^^^) can be 
partitioned as: 

CO 
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Each of these sums of squares takes on a meaning directly related to the 

concept of the reliability of judgments: 

!• Sum of squares for raters or judges (SSj^) reflects the differences 
among judges of the average values assigned across objects and 
categories. It is analogous to the differences in raters' "levels" 
in a univariate rating task and would usually be considered as "error." 

2. Sum of squares for categories (SS^) reflects the differences among 
categories in the average values assigned across raters and objects. 
For a suffieciently large number of raters » this component simply 
describes the sample of objects. Like the sum of squares for items 
in the Hoyt (1941) procedure for estimating test reliability, this 
component represents neither "true" nor "error" variance. 

3. Sum of squares for objects or subjects (SSg) reflects the differences 
among objects in the average values assigned across judges and 
categories. It would usually represent "error" variance. 

4. Sum of squares for judges-by-categories (SSj^^) reflects the differences 
among average values assigned by judges to each category. Again, this 
component would usually be considered to be "error" variance. 

5. Sum of squares for objects-by-categories (SSg^.) reflects differences 
among average values assigned to aach object-category combination. 
This, presumably, reflects "true" variance, since it is assumed that 
most objects "fit" one category better than the others. 

6. Sum of squares for judges-by-objects (SS^) reflects differences 
among average values assigned by judges to each object. Again, this 
source usually reflects "error" variance. 

7. Sum of squares interaction (SS^^^g) reflects residual variance— also 
an "error" component in reliability considerations. 

11 
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Thus, If we follow the usual procedure for estimating the average reliability 

for a single rater, we would use the general form 

By taking MS^^ , . to be MS^,. and varying the definition of MS , we can 
suDj • oL error 

obtain an array of reliability coefficients reflecting desired sources of 
"error." As an example, a coefficient which would reflect all soitrces of 
"error" would be 

" (r-.ytt-0 (4-0 <^ J 

Note that MS^ is never included, since it reflects neither "true" nor "error" 
variance. 

This generalized ANOVA approach makes possible the estimation of the 
rell bility of categorical judgments for nearly any conceivable situation. 
In addition, however, it also identifies and estimates the consequences of the 
various sources of "error" variance. There may also be cases in which formal 
tests of hypotheses concerning these effects are desired and they would be 
possible using this framework. Finally, this approach allows for a simple 
solution to certain problems involving missing data. In most cases, the relevant' 
terms will still be estimable even for less-than-complcte data. 
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Application of the General Procedure m Simple ClasBlf Ica t lon 

When judges are Instructed to classify each object Into a single category, 
of course all x^^j^ values become O's or I's. In this situation, the constraints 
imposed cause SSj^. SSg and SS^^ to be zero and SS^^^ to be ^(rs) . Under this 
condition. Equation 8 simplifies to 

It can be readily verified that this equation yields a value exactly equal to 
that resulting from Equation 3. Thus, this coefficient appears to have a 
very solid analytical basis and probably represents the most useful approach 
for simple classification problems. 

As an alternative, one could consider the average correlation for each 
category between assignments to the category, across all possible pairs of 
judges, averaged across all categories. 3uch a coefficient would have the 
desirable feature of being interpretable as an "expected" correlation between 
a pair of judges for any category. The numerator of the average of all 
'5cr(r-l) such correlations is proportional to SS^g-SSj^^g/d-l) . That suggests 
the use of the coefficient 
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Alternatively, this coefficient may be derived by analyzing each category 

separately into three components (SS . . ^ , SS. , and SS. ^ ^. ) 

ODjects* Judges interaction 

and pooling (summing) these terms across categories before applying the 
Hoyt (1941) procedure to the pooled sums of squares. 

This coefficient will be larger than that resulting from Equation 9 — 
a finding consistent with the fact that the difference among category marginal 
frequencies for Judges is ignored in the computations. This coefficient* 
however, would clearly have an expected value of zero under the hypothesis of 
independent classifications. 

It may be of interest to note that the extended |c coefficient of 
Equation 5 can be expressed as 



k. = 



The interpretation, in terms of reliability components, is not clear, at present, 
for this coefficient. It is, however, interesting to note that the nuzr rator 
is identical with that for Equation 10 while the denominator Is similar to that 
for Equation 9. 

ILLUSTRATION OF GENERAL METHODS 
Figure 3 displays the data (Enger, 1975) for 3 educational psychology 
students who assigned integer values 0-9 to 10 test items to reflect their 
relationship to 3 content areas. The relevant ANOVA terms are: 
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SS^Q^-957.A0 
SSj^ - 7.47 
SS^ -207.80 
SS, • 18.40 



SSgg -416.20 
SScR - 56.93 
SSrs - 20.53 



Thus, the comprehensive coefficient of Equation 8 has the value 

r<it&ot\t^u — i_ 



2^.U + 16.51, 

Other coefficients of Interest are 



IT- 



^ .388 



K - ' = = 412 

f . . - ^^^^ " = ^^ '^ ■ ^'^^ „ 4U 

In order to evaluate the effects of restricting judges to a simple 
classification (rather than the unrestrained weights), Enger' forced a post-hoc 
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classification based on the highest category weight for each judge. Recomputed 

using the classification "weights," he obtained 

^ -•214, 

«,240, and 

r - •"•266. 
pooled 

The dramatic decline In values for all coefficients probably reflects the loss 
of Information due to the restraints of classification. This suggests that 
a greater number of judges are required to attain reliabilities for classifica- 
tion tasks equal to those for unconstrained weights. In this example. It would 

require about 2.4 times as many judges to achieve an r , , value for 

pooled 

classification tasks equal to that for the weighting task. 

SUMMARY 

Generalized procedures have been described to facilitate the estimation 
of reliability coefficients for a wide variety of classification tasks and 
related multiple-category judgment decisions, 'ihls approach provides a means 
for better Identifying the sources of error variance and for testing hypotheses 
concerning these sources. 
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Figure 1 
Classification of 100 objects 
to 3 nominal categories by 2 judges 
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Figure 2 
Assignment of 20 test Items 
to 5 categories by 6 raters 
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Number of Items Assigned to Each Category by Rater 
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